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pairwise contact potential. First we used a Lennard-Jones potential to design 
ofF-lattice, protein-like heteropolymer sequences, whose lowest energy (native) 
conformations were then identified by Molecular Dynamics. Then we turned 
to investigate whether one can find a pairwise contact potential, whose ground 
states are the contact maps associated with these native conformations. We 
show that such a requirement cannot be satisfied exactly - i.e. no such contact 
parameters exist. Nevertheless, we found that one can find contact energy 
parameters for which an energy minimization procedure, acting in the space 
of contact maps, yields maps whose corresponding structures are close to the 
native ones. Finally we show that when these structures are used as the initial 
point of a Molecular Dynamics energy minimization process, the correct native 
folds are recovered with high probability. 

PACS numbers: 87.15.By, 87.10. +e 

I. INTRODUCTION 

One of the most challenging open problems in computational molecular biology is the 
prediction of a protein's structure from its amino acid sequence - the so called protein folding 
problem. Assuming that the folded state is thermodynamically stable [^ the problem can be 
formulated as follows: for given sequence of amino acids and interactions between them, find 
the conformation of minimal energy. A successful prediction is marked by the experimental 
validation of the structure. Such kind of prediction is still unfeasible 0. 

In order to tackle the protein folding problem from a theoretical point of view, in practice 
one has to choose 1) a representation of protein structure; 2) an approximation for the 
energy; 3) an algorithm to minimize the energy (i.e. to find the lowest energy structure for a 
given amino acid sequence). A conceptually straightforward approach is to perform detailed 



Molecular Dynamics simulations of fully detailed all atom models. However, numerical 
integration of Newton's equations is unrealistic with present computers on the time-scale of 
the folding process. Even more problematic is the choice of the energy function. An imperfect 
parameterization of the classical effective interactions between atoms could correspond to an 
energy landscape with minima unrelated to the desired ones. 

The true energy function which dictates the folding of proteins is unknown and one must 
resort to simple approximations. A fairly commonly used approximation, that of contact 
energies, was recently proved ^ to be too crude to successfully fold real proteins, whose 
native fold is stabilized by the "true" potential function. It is not known whether this 
failure is caused by the extreme complexity of the true potential, or to the oversimplification 
inherent in the contact energy approximation. Resolving this issue is the main purpose of 
this work. To do this, we pose and study in detail a very simple question: 

Can artificial "proteins", whose constituent residues interact by a Lennard- Jones 
pairwise potential, be folded successfully by a pairwise contact potential ? 

The aim of such a study is to uncover some of the general problems which arise when one is 
forced to resort to energy approximations. 

The Weizmann group has developed the contact map approach to protein folding. The 
contact map 0-§| of a protein with N residues is an A^ x A^ matrix S, whose elements are 
defined as Sij=l if residues i and j are in contact, and otherwise. Contact between two 
residues can be defined in different ways; one is to consider two amino acids in contact when 
their two Ca atoms are closer than some threshold (they used 8.5 rA [0). Another definition 
is based on the minimal distance between two ions that belong to the two residues 0J^. The 
central premise of this approach is that the contact map representation has an important 
computational advantage. Changing a few contacts in a map induces rather significant 
large-scale coherent moves of the corresponding polypeptide chain ITI]. 



In order to carry out the program they had to solve two problems: 



1. Finding an efficient procedure to execute a search in contact map space |11 



2. Developing a test to determine whether the resulting maps correspond to physically 
realizable conformations |^ 

Using these techniques, they have demonstrated that a the simple contact energy approxi- 
mation is unsuitable to assign the lowest energy to the native state even for one protein. |Q 
This last result highlights the fact that the bottleneck in the successful prediction of protein 
structure is the choice of the approximation for the energy. 

The Sissa group has pursued a different strategy developing minimalist off-lattice models 
of proteins. By studying toy protein models they were able to reproduce the essential features 
of the real folding process in terms of stability and accessibility (ffist of all the typical 
"funnel" structure for the energy landscape, that has been shown |[10| to play a critical 
role). They found that the case of a Cq, chain with 4 species of amino acids, equipped by 
a suitable design procedure, is effective to represent global features of the energy landscape 
p!5| . In their model, the potential is the simplest generic off- lattice potential that maintains 
chain connectivity, provides an attractive interaction between monomers and ensures self- 
avoidance: a polynomial potential between beads that are nearest neighbors along the chain 
(to represent the Ca-Ca virtual bond) and a Lennard- Jones (LJ) potential among all other 
beads. This off-lattice toy model of proteins has been a suitable tool to test potential 
extraction techniques |lT7| , p!8[| and, on the other hand, it useful to study of the dynamical 



properties can provide important insights into the real mechanisms acting in real proteins. 
In this work, we investigate whether the contact map representation is as a suitable 
approximation to the LJ model of proteins; that is, whether there exists a suitable choice 
of contact energy parameters, for which the ground states of a pairwise contact potential 
constitute a good approximation to the lowest energy states of the LJ potential. We also 



discuss whether dynamics in contact map space, used in combination with Molecular Dy- 
namics can, at least in the simple case presented here, be successfully used to perform energy 
minimization. 

II. PROTEIN DESIGN WITH LENNARD-JONES POTENTIALS 

A. Definition of the Off-Lattice Model 

In this first part of the study we represent a protein by a chain of A^ monomers or beads, 
representing the Ca atoms of the amino acids. A configuration r of the chain is defined by the 
coordinates ri, . . . , ftv of each Ca atom in three-dimensional continuous space. We consider 
effective pairwise energies between amino acids. The interaction potential Vij between the 
monomers i and j is the sum of a covalent bond term plus a LJ term: 

V^, = 5,,_J{r,,) + (1 - 5,,.,)V^' (1) 

where 



rLJ 



Kr = v^, 






(2) 



and Vij = \ri — Vjl are the interparticle distances. 

For the energy bond function f{rij) we choose the expression: 

/(x) = a{x — doY + h{x — d^Y^ (3) 

that is the fourth-order expansion of a generic and symmetric function of the distance x. 
The parameters a and h are taken to be 1 and 100 respectively. We add a quartic term to the 



usual quadratic one ||20| because a plain harmonic potential could induce energy localization 
in some specific modes, significantly increasing the time needed for equilibration. 

The parameter do represents the equilibrium distance of the nearest neighbors along the 
chain. We set d^ = 3.8, as this is the experimental value for the mean distance between 



nearest neighbor C^ atoms along the chain in real proteins, as taken from the Protein Data 
Bank. 

The total energy of a chain in conformation r is defined as: 

N 2 N 

^ = Ey + EE% (4) 

1=1 i=l j>i 

The first term is the classical kinetic energy of the chain, where the pj's are the canonical 
momenta conjugate to the Fj's. 

B. First step: generation of off-lattice native-like structures 

Off-lattice there are an infinite number of conformations almost all of which are not 
designable (i.e. there is no sequence which has the conformation as its ground state). We 
aim at working with dynamically accessible, designable and compact structures. We present 
here an outline of the procedure; further details can be found in Ref. |T^. The simplest 
way to obtain such configurations is to initially deal with homopolymer chains, setting the 
parameters rjij = t] and aij = a of Eq. (|^) for all the beads in the chain. By cooling such a 
chain below the 6 temperature Q we obtain maximally compact configurations. 

We fix the parameters to the constant values rj = 40 and a = 6.5. The parameter r] 
determines the energy scale, while a determines the interaction length between monomers. 
Such a value for a ensures that, in practice, two monomers significantly attract each other for 
distances smaller then 8 rA, as we show in § |IV| . Such distance threshold is usually used for 
the inter-amino acids bond, and is determined by the requirement that the average number 



^It is known |19] that varying the temperature an homopolymer chain presents two different phases 
according to the dominance of attractive or repulsive interaction energy (as it can be with the LJ 
potential). The transition temperature is usually called 9 temperature. 
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of Ca — Ca contacts for each amino acid is roughly equal to the respective numbers obtained 
with the all-atom definition of contacts 0]. 

We used Molecular Dynamics (MD) (integrating Newton's laws of motion on a computer) 
for simulating the kinetics of the chains. We employed an efficient and precise simplectic 



algorithm [^, in which one varies the energy density e = E/N, which is related to the 



temperature |2^ . In order to find low energy structures of the chain, we used a combination 



of MD and slow cooling. Starting with different, randomly selected, initial conditions we 
find different low energy configurations for the chain, since the minimum energy state of a 
homopolymer is largely degenerate. For each MD simulation the chain is equilibrated at its 
initial energy during 8000 integration steps. We estimate the temperature of the system by 
computing the average value of the kinetic energy of the chain p2[ . Then, after 8000 steps, 
we slowly decrease the temperature by rescaling each component of the momenta by the same 
factor ( < 1). We use the value 0.8 as a cooling factor. Then the chain is again equilibrated at 
the new temperature and the procedure is iterated until very low temperatures are reached. 
At low temperature the chain is "trapped" in the compact configuration of minimum energy. 

We performed 6 times this algorithm and obtained 6 different native-like structures for 
sequences of lengths A^ = 30, whose coordinates will be denoted by r* , (a = 1, . . . , 6). 

We measure the difference between two structures of equal length N, using the root mean 
square (RMS) distance D: 



D{v,.<) 



1 ^ 



M^E(r.-rO^ (5) 



where one structure is translated and rotated to get a minimal D. The standard procedure, 
described in Ref. [|TB| was used. Two structures are considered different if their distance D 
is larger than 1 rA. The mean distance among our 6 different native structures is about 5 
rA. 



C. Second step: protein design 

At this point we need to select, for each chosen compact structure, a sequence having 
its energy niininiuni on this structure. Furthermore, it should be able to reach the selected 
structure in an accessible time. In order to generate such sequences we use the design proce- 
dure of Ref. |]n[. We use 4 types of amino-acids, thus we deal with 10 different parameters rjij 
{i,j = 1, ... ,4). We set these parameters by hand, with the constraint that the inequalities 
2?7a/3 < etttaa + ^.ta^p, {a ^ (3) , are satisfied. The following values have been used: 
?7i,i = 40, ?7i,2 = 772,1 = 30, ?7i,3 = r73_i = 20, r7i_4 = 774,1 = 17, 772,2 = 25, 772,3 = ^3,2 = 13, 
r]2A = Vi,2 = 10, r/3,3 = 5, 773,4 = 774,3 = 2, 774,4 = 1. 

We fix an identical amino acid composition for all the sequences and the values for each 
matrix element rjij. Then we select the optimal sequence for a target structure using a Monte 
Carlo algorithm in sequence space. Starting with a random sequence, at fixed composition, 
we perform random permutations accepting or rejecting new sequences with respect to the 
Boltzmann factor P = exp(— AE'/Zc^T), where AE is the energy variation due to permu- 
tation and T is the "temperature" parameter of the Monte Carlo optimization scheme {ks 
is set equal to 1). By slowly lowering the parameter T we select the sequence that has the 
lowest energy in the target conformation. In such a way we modify the energy landscape: 
out of the multidegenerate minimum energy scenario of the homopolymer we select a single 
minimum configuration lowering its energy by designing an appropriate sequence. 

This method to select the sequence cannot guarantee that the target structures are indeed 



global energy minima. To check the obtained sequences we slowly cool (as in §|TB|) each 
selected sequence about 50 times (each time starting from a different initial condition). The 
cooling simulations find, as the lowest energy configurations, the same structures as before. 
Hence, these target structures are probably the lowest energy structures for the selected 
sequences, at least among the dynamically accessible ones, and this result is confirmed by 



the simulation in contact map space presented below. This repeated cooling process generates 
also a set of alternative, higher energy, metastable configurations (typically 2 — 3 for each 
sequence) that, if perturbed, "decay" to the corresponding global minimum configuration. 

III. DERIVATION OF A SET OF PAIRWISE CONTACT ENERGY 

PARAMETERS 

In the pairwise contact approximation the energy is written as 

N 

EP^''^{8i,S,w) = J2S,M(^,,aj). (6) 

i<j 

We denote by a the sequence of amino acids, by S the conformation (represented by its 
contact map [0]) and by w the set of energy parameters. If there is a contact between 
residues i and j, the parameter w{ai,aj), which represents the energy gained by bringing 
amino acids Oj and aj in contact, is added to the energy. Two amino acids are said to be in 
contact if their C^ atoms are closer than an upper threshold distance Ru whose value will 
be discussed below. 

We now explain in detail the approximation involved in Eq. (j^). Denote by C a micro- 
state of the system. In general the micro-state is specified by the coordinates of all the 
atoms of the proteins and of the water molecules of the solvent. In the present case, we are 
considering a Lennard- Jones model of a Cq, chain, so a micro-state is completely specified 
by the coordinates of the N monomers comprising the chain. Since many microscopic con- 
formations share the same contact map S, it is appropriate to define a free energy H(a, S) 
associated with this sequence and map: 

Prob(S) oc e-^(^'^) = ^ e-^(^)/'=^^A(C, S) (7) 

c 

where A(C, S)=l if S is consistent with C and A = otherwise; i.e. A is a "projection 
operator" which ensures that only those configurations whose contact map is S contribute 
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to the sum (|^. For real proteins E{C) is the unknown "true" microscopic energy, here it is 
the Lennard- Jones energy as of Eq. (|T]). 

Since it is impossible to evaluate this exact definition of the free energy of a map, we resort 
to a phenomenological approach, guessing the form of 7i(a, S) that would have been obtained 
had the sum (|^) been carried out. i^^^°*^(a, S,w) of eq. (P) is the simplest approximation 
to the true free energy. To test the extent to which this approximate form is capable of 
stabilizing the native map of a protein against other non-native maps, we must specify the 
parameters w. 

To derive the set w of pairwise contact energy parameters we require that the following 



set of conditions be satisfied for all the proteins in the database [§],|T^|T8 

^^"^'■(a, S*, w) < ^^'^^"(a, S^, w) . (8) 

Here S* is the contact map of the native conformation of sequence a and S^ (yU = 1, . . . N^) 
are contact maps taken from a set of Nd alternative conformations. The problem is to find 
a set w such that for each decoy /i the inequality (||) is satisfied. The perceptron learning 
technique p3|- p5|] is used to search for contact energy parameters w for which the set of 
conditions (§) are satisfied. The set w that gives the optimal solution of this problem is 
equivalent to the maximization of the gap between the ground state and the first excited 
state, as it is shown in the Appendix. 

A. Generation of alternative conformations 

To obtain a large ensemble of decoys we performed dynamics in contact map space 



collecting maps along the trajectory. The method is presented elsewhere |TT|; here we outline 
only the details that are relevant for the present implementation. Our algorithm is divided 
in four steps. 
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1. We start from an existing map and perform large scale "cluster" moves. At this stage, 
no attempt is made to preserve physicality. The contact map which is obtained by this 
procedure is typically uncorrected to the starting one. 

2. The resulting map is refined by using local moves. 

3. We use the reconstruction algorithm [Q to restore physicality by projecting the map 
obtained from the second step onto the physical subspace. 

4. We perform a further optimization by an energy minimization in real space using a 
standard Metropolis crankshaft technique |]ll . 



Using this algorithm, after the choice of a suitable definition of contact -as detailed in 
IV B| and § [IV C| -, we generated a set of A'^d=60000 alternative conformations, 10000 for 



each of the 6 sequences in the database. As discussed in Ref. |]TT|, the contact maps that are 



obtained by contact map dynamics are uncorrelated and the Nr, decoys form a representative 
set of low energy competitors for the native state. The important requirement is to generate 
a large set of uncorrelated decoys, since a good energy function must stabihze the correct 
structure and destabilize all the others. 

This selection of candidates was performed using the set of LJ parameters rjij as a first 
guess for the pairwise contact energy parameters w. 

IV. SQUARE WELL APPROXIMATION OF THE LJ POTENTIAL 

A. Truncation of the LJ potential 

In the definition of contact one has to specify the upper threshold Ru of the distance 
between two beads, below which are considered to be in contact. To get an estimate of a 
reasonable value of Rjj we studied the truncation of the LJ potential. Using MD, we analyzed 

11 



the stability of the folded conformations against a cut in the tail of the LJ potential. We 
used a potential equal to the Lennard- Jones function for r smaller than Re, and equal to 
zero for r larger than Re'- 



v^Ai^r-i^r) iin,<Rc 



yur^,) = 



(9) 
if Uj > Re 

For a given Re, we determined the ground state conformations of the 6 proteins in the 
database. This was done by performing n runs of Molecular Dynamics minimization (by slow 
cooling, as described in § [11 B| with typically n = 50). For each protein, each minimization 
ended in a slightly different structure t'^{Rc) (a = 1,...,6, z/ = l,n) We measured the 
RMS distance {D{Rc)), averaged over the 6 proteins in the database, between the native 
conformations and the conformations found as minima of the potential of Eq. (§) : 

6 n 

{DiRc)) = T.T.D{r:,r:{Rc)). (10) 

In Fig. |l| we show the results for each protein, for several choices of the cutting length Re- 
In the inset the average distance {D{Rc)) with its standard deviation as a function of the 
cutting length Re are shown. Remarkably, it is possible to cut the LJ potential inside the 
attractive well down to -Rc=8 rA, still keeping the average distance below 1 rA. 

B. Derivation of the region of physicality 

For any value of Ru the contact map S of any chain conformation can be easily deter- 
mined. In order to construct decoys, however, we must deal with the inverse problem - to 
get a chain conformation starting from some contact map S. In order to apply the recon- 
struction procedure introduced in Ref. [@ for a possibly non physical contact map one has to 
specify another length; the hard core distance Rl, defined as the minimal allowed distance 
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between two amino acids. This is essential since without such a lower cutoff we would be 
able to construct chain configurations for maps with much higher numbers of contacts than 
the native maps and hence would identify such dense decoys as physical. Since the contact 
energies tend to be negative such decoys will have much lower energies than the native map 
and will never be able to find a solution to the inequalities eq. ( |A1|) . 

In Fig. 1^ we show the histogram of all the pairwise distances for the six native con- 
formations considered in this study. Apart from the peak at 3.7 rA, due to consecutive 
residues along the chain, the distances range from 6.1 to 18.8 rA. The vertical line indicates 
6.9 rA, a distance whose role will be discussed below. There are 2436 distances between 
non-consecutive beads, of which 49 are below 6.9 rA and 7 below 6.5 rA. 

We now consider the plane {Ru, Rl) and set about to determine the "region of physical- 
ity" which is that part of the plane whose points allow reconstruction of the contact maps 
that correspond to the six native folds. To understand the problem we are facing, one should 
bear in mind that once we chose a value for Ru, the contact map associated with a confor- 
mation is determined. When, however, we try to reconstruct one of these maps, we must also 
specify a value for R^ and if the chosen value is too high, we may find that no corresponding 
chain configuration exists. In this case, according to definition, the native contact maps are 
non-physical. The gross features of such a region can be outlined by some simple arguments. 
First, since we must have Ru > Rl, the region of physicality must clearly lie below the line 
Ru = Rl- Second, if Rl is too high, the contact maps are non physical and there must be a 
(possibly horizontal)line that bounds the physical region from above. The exact location of 
this line must be determined numerically. For Rl < 6.1 it is clear that all the native contact 
maps are physical, but for R^ > 6.1 rA the situation is not clear. It may be able to "catch" 
the few "close" contacts (those with d < 6.9 rA) by locally stretching the chain. We proved 
numerically that this is indeed possible up to about Rl < 6.7 rA, as shown in Fig. |^a and 
in more detail in Fig. ^. The physical region is the shaded trapezoid open on its right side; 
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for points in this region we are able to reconstruct chain conformations for all the six native 
maps (derived from their "true" chain conformations by the choice of Ru)- 

C. Derivation of the region of learnability 

We turn now to determine a second region in the {Ru, Rl) plane, that of " learnability". 
At a point {Ru,Rl) in this region it is possible to find a set of energy parameters w such 
that each of the six native maps have the lower energy than all their respective decoys. As 
was done above for the region of physicality, it is possible to predict the general shape of 
the region of learnability a well. First, it is limited by the same bisecting line Ru = Rl- 
Second, there is a vertical line at Ru = 6.1 rA, at the left of which all the contact maps 
are empty - we are not interested in such a case. Third, there is a second vertical line, 
at Ru = 19, at the right of which all the contact maps are filled. Also this case is not 
interesting. Fourth, there must be a line (approximately horizontal) below which the energy 
parameters are unlearnable. We expect this since for small values of Rl amino acids are 
allowed to be very close to each other. Very compact conformations are possible and since 
the interaction parameters are on the average attractive, such compact conformations have 
very low energy. The exact location of this line must be determined numerically. The result 
is shown in Figs. ^ and §3. For each choice of Ru and Rl, we generated by contact map 
dynamics a set of Nd = 60000 decoys, 10000 for each sequence in the database. Using the 
perceptron algorithm, we determined if such set of decoys was learnable or not. The outcome 
of the procedure is the triangular shaded region in Fig. ^, whose lower edge is shown in 
detail in Fig. ^o. 

The most interesting result is that the region of physicality and the region of learnability 
do not overlap. It is not possible to choose {Ru, Rl) such that the contact maps of the 
native LJ conformations in the database are physical and they can be stabilized by a pairwise 
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contact potential. 

Two observation must be made. First, the points labeled as "unlearnable" are rigorous, 
whereas those labeled as "learnable" are tentative - it is possible that by increasing the 
number of decoys also those points will become unlearnable. The lower edge of the region 
of learnability could actually be moved upwards. The same effect should be expected by 
increasing the number of proteins in the database. Second, the points labeled as "physical" 
are rigorous, whereas those labeled as "non physical" are tentative - we could not reconstruct 
the corresponding contact maps, but this failure could be due to our reconstruction algo- 
rithm. Thus, the upper edge of the region of physicality could be moved upwards. However, 
by increasing the number of proteins in the database we expect that this upper edge would 
move downwards. In the limit of a large number of proteins the latter effect could possibly 
dominate. In conclusion, we argue that the gap between the region of learnability and the 
region of physicality is expected to widen when many proteins are taken into account. 

V. FOLDING IN CONTACT MAP SPACE 

Since we have proved that the two regions, of learnability and physicality of the native 
LJ contact maps do not overlap, we can conclude that the answer to the question posed in 
the Introduction is negative - the contact energy approximation is unable to stabilize the 
native folds of LJ chains. The contact map approximation is too crude also for extremely 
simplified potential as the LJ potential used in this work. 

The situation is, however, more subtle. We can look for contact energies that give 
rise to approximate solutions of the folding problem: it is impossible recover the LJ native 
configurations using the contact map approximation, but we want investigate how "close" 
to the solution the contact map approximation can lead. 

To tackle this point, we selected a working point in the learnable region. We chose Ru=8 
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rA, according the the minimal value of Re determined in Sec. IVA and R^ = 6.9 rA. As 
shown above, for this value of Rl, native maps are non physical. In other words, for a given 
native map there is no chain which can realize that contact map. We found, however, that 
for these Rl and Ru it is possible to reconstruct conformations whose maps are physical and 
typically differing by only a few contacts from the native map. Moreover, the corresponding 
3D structures are, on the average, at a RMS distance of less that 2 rA from the native LJ 
conformations. We decided to choose Rl = 6.9 rA, as the best possible compromise between 
solving Eq. (§) and our aim is that some physical maps of low energy are close to the native 
maps. 

Now we turn to answer the following important question: is the set w obtained this way 
useful to fold LJ proteins? The first aspect of this general question concerns the feasibility 
of using contact map dynamics. That is, we ask: 

Is it possible to fold the sequences in the database using contact map dynamics? 

Fully successful folding in contact map space corresponds to exact recovery of the native 
map. Since for R^ = 6.9 rA and Ru = 8.0 rA the native contact maps are non physical, 
such an exact solution is not within our reach. Nevertheless, the contact map dynamics 
could select configurations slightly different from the native ones. 

For each of the 6 sequences we started from some random contact map and ran the 
contact map dynamics procedure for 1000 steps; this took about 3 cpu hours on our HP812 
computer. We now analyze in detail our results. 

As already observed in Ref. [Q], the correlation between energy and distance from the 
native map (the "funnel") is not always very strong when using _£;?"*''. A case of quite good 
correlations, obtained by contact map dynamics for sequence N° 5 (among the six used in 
this work), is shown in Fig. ^ Since the lowest energy contact map found in the simulation 
might or might not coincide with the closest in distance to the native one, in general we can 
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obtain only a short list of candidates for the native state. 

For sequence N° 5 the physical map of lowest energy is, in fact, the closest in distance to 
the native one. Both maps are shown in Fig. ^. The energy E^"-^"^ of the native map is -35.75 
and the energy of the predicted one is -33.33. The RMS distance between two corresponding 
conformations, shown in Fig. ^ is 2.1 rA. We emphasize again the obvious fact that since 
our w is a solution of (|A1|) , the native map has lower energy than all the low-energy decoys 
that were used for learning. Indeed, in no case have we found by contact map dynamics 
maps lower in energy than the native one. We summarize in Table | the results obtained for 
the six sequences. 

The Hamming distance Dh between two contact maps S and S' is defined by 

IS*-- — 5" I 
"" = g iV.(S)iV.(S') • <i^' 

which counts the number of mismatches between maps S and S'; Nc{S) and A^c(S') are the 
numbers of contacts in the two maps. 

The second and complementary question we pose is 

Are the low energy conformations generated by contact map dynamics good start- 
ing points for Molecular Dynamics minimization? 

We analyze again sequence N° 5. We randomly selected 10 conformations among the 100 
of lowest EP"-^^ energy that were generated by contact map dynamics and reconstructed 
their corresponding conformations. These were then used as starting points for a Molecular 
Dynamics minimization of the E^"^ energy. In Fig. |^ we present the energy and distance to 
the native conformation for the conformations obtained by MD. As can be seen, 8 trajectories 
ended up very close to the native state and 2 in conformations at RMS distances of 6 and 
8 rA respectively. We note that these last two initial conformations had the largest RMS 
distance from the native state. 
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We conclude that the predictions performed in contact map space using the approximate 
energy E^"-'^'' are, on the average, suitable starting points for a Molecular Dynamics mini- 
mization of the L J energy. Moreover, by using Molecular Dynamics it is possible to correctly 
rank the predictions that were obtained by contact map djTiamics. 

VI. CONCLUSIONS 

There are four different situations for which the contact pairwise approximation has been 
used: 

1) When the true potential is a contact pairwise one, such as for lattice models with 
nearest-neighbor interactions, the energy E^"-^^ of Eq. (^ is, by definition, the exact (free) 



energy. Models of this type were investigated in two recent studies [p!8| , p6| . In both works, a 
database of proteins was designed using E^"'^^, with a given choice of energy parameters w*™^. 
The database was then used to recover a set w, which ideally could be identical to w*™*^. 
Since the native states of the proteins in the database are the lowest energy conformations of 
£ipair ^^true^ ^ Eq. (||) cau bc Satisfied for all the possible conformations for all the sequences 
in the database. 

2) When the true potential is unknown, such as for real proteins, Eq. (|^) is only an 
approximation to the true free energy (|^), and there is no guarantee that such an approxi- 
mation would lead to good results. The conformations stabilized by the true energy function 
are not necessarily stabilized against all decoys by a pairwise contact energy function. For 
a particular protein, crambin, it has been shown [Q that, in fact, such a solution does not 
exist and the contact maps of lowest pairwise energy are not close to the native maps. 

3) The "true" potential is a contact pairwise one, and it is used to design protein sequences 
on the structure of existing proteins. This case was considered in Ref. p6[, where it was shown 



that it is unfeasible to design sequences, using a pairwise contact potential, on structures 



of existing proteins. This result implies that the native structures of existing proteins are 
"designed" in a very peculiar way by the possibly very complex potential used by Nature. 
A simple pairwise contact potential can neither be tuned to stabilize the true sequences (see 
point 2) nor used to perform protein design on the true structures. 

4) The true potential is a Lennard- Jones pairwise one and it is used to design sequences 
on structures typically arising from Lennard- Jones interactions. The resemblance of such 
structures to those of real proteins appear to be reasonable |]l5l, however, as discussed in 



this paper, the results obtained for approximate contact potentials are remarkably different. 
Here we have shown that it is not possible to stabilize the exact native folds. This result 
is somewhat surprising - the approximation seems to be quite reasonable, since both the 
contact potential and the LJ potential are characterized by a single length scale. However, 
we also showed that in this case an approximate way to solve the folding problem can be 
found. We showed that a dynamics in contact map space is a suitable tool to perform 
energy minimization, when a correct parameterization of the energy is possible. Contact 
map dynamics provides, for a good choice of contact parameters, not just a unique structure 
corresponding to the lowest energy state, but a set of candidates for it. When this list of 
predictions is used as starting configurations of a MD energy minimization, which uses the 
original "true" LJ form of the energy, the correct ground states are recovered. 

It remains to be investigated if a hybrid method, based on "fusion" of contact map 
dynamics and molecular dynamics, in a way that takes advantage of the their respective 
advantages (that in some sense are complementary) may be a powerful tool to fold real 
proteins as well. 
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APPENDIX: PERCEPTRON LEARNING 

We describe here the perceptron learning technique that was used. 

We first show that for any conformation the condition Eq.(D can be trivially expressed 

as 

w ■ x^ > (Al) 

To see this just note that for any map S^ the energy (^ is a linear function of the 10 contact 
energies that can appear and it can be written as 

10 

E^"^'-(a, S^, w) = ^ iV,(a, S^)w, (A2) 

c=l 

Here the index c = 1, ...10 labels the different contacts that can occur and Nc{si, S^) is the 
total number of contacts of type c that actually appear for the sequence a in the map S^. 
The difference between the energy of this map and the native S* is, therefore, 

10 

AE^ = 5]a;>e = w-x^ (A3) 

c=l 

where we used the notation 
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x^ = iV,(a,S^)-iV,(a,S*) (A4) 

and S* is the native map. 

Each candidate map S^ is represented by a vector x^ and hence the question raised above 
regarding stabihzation of S* for a sequence a becomes 



Can one find a vector w such that condition (|A1| ) holds for all x' 



r/'? 



If such a w exists, it can be found by perceptron learning. 

A perceptron is the simplest neural network |2^ . It is aimed to solve the following task. 
Given a set P of patterns (also called input vectors, examples) x^, fi = 1,...,P, find a 
vector w of weights, such that the condition 

/i^ = w ■ x^ > (A5) 

is satisfied for every example from the set. If such a w exists for the training set, the problem 
is learnable; if not, it is unlearnable. We assume that the vector of "weights" w is normalized, 

w ■ w = 1 (A6) 

The vector w is "learned" in the course of a training session. The P patterns are pre- 
sented cyclically; after presentation of pattern /x the weights w are updated according to the 
following learning rule: 



w' 



w+»7X 



M 



if w ■ x„ < 



(A7) 



w otherwise 

This procedure is called learning since when the present w misses the correct "answer" 
h^ > for example /i, all weights are modified in a manner that reduces the error. No 
matter what initial guess for the w one takes, a convergence theorem guarantees that if a 
solution w exists, it will be found in a finite number of training steps. [^,0 . 
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If the region of parameter space whose points are solutions is large, one is interested in 
the optimal solution. To obtain the optimal perceptron solution, called the maximal stability 
perceptron, we used the Krauth and Mezard algorithm [^. In this algorithm the condition 
(|X5|) is replaced by 



/i^ = w ■ x^ > c (A8) 

where c is a positive number that should be made as large as possible. 

At each time step the "worst" example x.'^ is identified, namely the one such that 

h,y = w ■ x.'^ = min w ■ x^ (A9) 

and the example u is used to update the weights according to the rule (|A7|) . The field hi^it) 
keeps changing at each time step t and the procedure is iterated until it levels off to its 
asymptote. 
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Captions to the figures 

Fig. 1. Upper part: Approximation of tlie Lennard- Jones potential by a contact poten- 
tial. 

Lower part: Distance of the conformation of lowest energy found in a MD simulation, done 
with a potential truncated at Re-, to the true native state of the non truncated LJ interac- 
tion. Each dot represents a different run. In the inset the average {D{Rc)) is shown. Down 
to Re = 8 rA the distance to the true native state is below 1 rA. 

Fig. 2. Distance probability distribution between amino acids in the six native confor- 
mations in this study. 

Fig. 3. Regions of physicality and of learnability in the {Ru,Rl) plane. 

Fig. 4. Correlation between energy E'^"-"' and RMS distance D to the native state and 



between energy and the Hamming distance Dh (see Eq. ([TT])). Results refer to contact map 
dynamics. 

Fig. 5. Contact maps of sequence N° 5: native (below diagonal) and of lowest energy 
found during the simulation (above diagonal). 

Fig. 6. Native conformation of sequence N° 5 and its lowest energy conformation found 
during the simulation. The RMS distance is 2.1 rA. 

Fig. 7. MD folding trajectories of sequence N° 5. Each trajectory (small connected 
full dots) is obtained starting from 10 conformations (large full dots) randomly chosen from 
the 100 conformations of lowest energy obtained by contact map dynamics. Conformations 
(empty large dots) collected during other MD simulation (see § [1I C|) are also shown as refer- 
ence. 
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TABLES 



sequence 


1 


2 


3 


4 


5 


6 


Eo 


-34.78 


-36.39 


-36.56 


-36.29 


-35.75 


-36.17 


E 


-33.53 


-35.17 


-34.87 


-35.60 


-33.33 


-33.59 


D 


4.2 


4.2 


2.1 


2.2 


2.1 


4.3 


Dh 


0.60 


0.61 


0.31 


0.34 


0.25 


0.60 



TABLE I. Summary of the results of the folding experiments in 
contact map space on the six sequences, using the pairwise contact pa- 
rameters of maximal stability. Eq is the energy of the native contact 
map. E is the energy of the contact map of lowest energy found during 
the simulation, D is the RMS distance to the native state conformation 
and Dh is the Hamming distance to the native contact map. 
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